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ABSTRACT 

This paper deals with test fairness regarding a test 
consisting of two parts: (1) a "common" section, taken by all 
students; and (2) a "variable" section, in which seme students may 
answer a different set of questions from other students. Tor example, 
a test taken by several thousand students each year contains a common 
multiple-choice portion and a common essay portion but also a 
variable essay portion, in which the test-taker may choose to answer 
any one of five questions. On this test the questions that the 
test-taker may choose from are intended to be of equal difficulty. 
When the scoring has been completed and the results tabulated, the 
data occasionally suggest that two or more essay questions may not 
have been of equal difficulty. If there had been no reason to 
believe, a priori, that the questions on the variable portion were of 
equal difficulty, the scores would need to be adjusted in such a 
situation. Problems arise with the two adjustments: option A leads 
farthest away from the assumption of equal difficulty when the 
evidence against it is weakest; and option B is equivalent to 
assuming that the questions in the variable portion are, in fact, 
equally difficult. A compromise is proposed that is closer to option 
A when the common portion predicts the variable portion accurately 
and closer to B when it does not. (PN) 
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Adjusting Scores on Examinations Offering a Choice of Questions 

Samuel A. Livingston 
Educational Testing Service 

This paper Is about fairness In testing. It deals with a particular 
type of test. This type of test consists of two or more parts. At least 
one part Is a "common" section, taken by all the students. But the test 
also contains at least one "variable" part, In which some students may 
answer a different set of questions from other students. Often these tests 
allow the student a choice of questions on the variable portion of the test. 
For example, one test taken by several thousand students each year contains 
a common multiple-choice portion and a common essay portion but also a 
variable essay portion. In which the test-taker may choose to answer any one 
of five questions. On this test — and possibly on other tests that allow a 
choice of questions—the questions that the test-taker may choose from are 
Intended to be of equal difficulty. In fact, the developers of the test 
work very hard to produce questions of comparable difficulty, and the 
scoring leaders work very hard to establish and maintain scoring standards 
that are comparable across questions. 

Nevertheless, when the scoring has been completed and the results 
tabulated, the data occasionally suggest that two or more essay questions 
may not have been of equal difficulty. Consider the example in Table 1. A 
comparison of the groups answering questions 5 and 6 should cause us at 
least to question an assumption of equal difficulty. Group 5, on the basis 
of the common portions, appears to be as able as the other groups, but their 
scores on the variable portion average a third of a standard deviation 
lower. Group 6 appears, on the basis of the common portions, to be somewhat 
weaker than the other groups, but their scores on the variable portion 
average slightly higher. 



If we had no reason to believe, a priori , that the questions on the 
variable portion were of equal difficulty, we would surely want to adjust 
the scores In such a situation. We would assume that groups of students 
whose performance on the common portion Indicates they are of equal ability 
should also receive similar scores on the variable portion. Probably the 
simplest way to make an adjustment based on this assumption would be to 
estimate a "question effect" for each question and subtract this estimated 
"question effect" from the student's score on the variable portion. This 
kind of an adjustment would completely disregard all the attempts to make 
the questions on the variable portion equally difficult. 

Of course, we could use a much more sophisticated type of adjustment. 
For example. Instead of conditioning on the total score from the common 
oortlon, we could condition on some combination of subscores. Or we could 
condition on a weighted composite of the Items In the common portion, 
choosing weights that maximize the difference between the groups of students 
choosing different questions on the variable portion of the test. Instead 
of estimating a constant question effect, we could adjust for differences In 
the conditional means and the conditional standard deviations, and maybe 
some higher moments of the conditional distributions. Stating this approach 
as generally as possible, we would condition on some function of the 
response pattern from the common portion, which would serve as a common 
measure of ability. Then we would assume that some characteristics of the 
distribution of scor«*s on any given question In the variable portion would 
be the same. In some specified way, for all groups of students of equal 
ability, as indicated by the common portion. In particular, we would as?sume 
that the scores on any question in the variable portion would have been the 
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same for the students who did not answer the question as they were for the 
students who did answer It, when we condition on the common ability Treasure* 
But even with this very flexible approach, we would still be disregarding 
all the attempts to create questions of equal difficulty on the variable 
portion. 

This approach Is presented In Table 2 as "Option A". We cannot observe 
the responses of Group 2 to question 1. If Group 2 had answered question 1, 
how would they have performed. In comparison to Group 1? To answei this 
question, we condition on the common portion and then assume that no further 
ability difference exists between the two groups. What might make us 
uncomfortable about such an approach? Consider the case In which the 
responses to the common portion do not do very well af predicting scores on 
the questions In the variable portion. Figure 1 presents a very simple 
example, using just the total score on the common portion as the predictor. 
The solid ellipses represent the data we can observe; the dashed ellipses 
represent the distributions we Impute under this assumption. You can see 
how the assumption Implies that Group 2, with much lower scores on the 
common portion, would have done nearly as well as Group 1, If they had taken 
question 1. 

There Is another problem with the condltlonally-equal-ablllty assumption 
of Option A. If the relationship between the common portion and the 
variable portion Is weak, the reason may well be that the two portions 
measure somewhat different skills. Yet this Is exactly the case In which 
the Imputed score distribution for Group 2 on question 1 will be farthest 
from their actual score distribution on question 2. That Is, a weak 
relationship with the covarlate leads to a large adjustment. Remember, we 
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have some non-statlstlcal Infonnatlon telling us that the questions are at 
least approximately equal In difficulty. The traditional approach of Option 
A leads us farthest away from the assumption of equal difficulty when the 
evidence against It Is weakest. 

If Option A Is not fully satisfactory, what about Option B? Option B 
says to assume that If Group 2 had taken question 1, they would have done 
just as well on Question 1 as they actually did on Question 2. This 
assumption Is equivalent to assuming that the questions on the variable 
portion ar«, In fact, equally difficult. Under Option B, we would never 
adjust the scores on the variable portion, no matter what the scores on the 
common portion looked like. This assumption might not make us too 
uncomfortable In a situation like that of Figure 1, but look at Figure 2. 
In this example, the scores on the common portion are strongly related to 
scores on the variable portion. Yet, Group 2, with much lower scores on the 
common portion, gets much higher scores on the variable portion. 

What we need Is some sort of compromise between the two approaches T 
have labeled Option A and Option B, preferably a compromise that depends on 
the data. We would like a solution that Is closer to Option A when the 
common portion predicts the variable portion accurately and closer to Option 
R when It does not. The only solution I have been able to come up with Is 
one that requires a subjective decision. This approach says: Look at the 
difference between conditional means, compare It with the size of the 
conditional standard deviation, and ask yourself, "How big a difference am I 
willing to believe Is a genuine difference In ability between the groups?" 
Option A, the basic covar lance adjustment, says "None — any difference in 
conditional means must be the result of differences in question difficulty 



(or scoring standards, etc,)," Option B, which leads to no adjustment, 
says, "All of It— any difference I observe must be a wnulne ability 
difference, even though we are comparing students who are equal on x," I 
say, why force yourself to choose one or the other of these extreme 
positions. Why not say, "I will believe that a difference up to one, or 
two, or three conditional standard deviations could be due to genuine 
ability differences. I will adjust so as to remove any difference beyond 
that," 

This approach does have the property of producing an adjustment that is 
larger when x predicts y more accurately. The more accurate the prediction, 
the smaller the conditional standard deviation, and the smaller the 
allowable difference between conditional means. 

There is one feature of an adjustment based on this approach that runs 
counter to most people's idea of fairness, but we can correct the problem 
with a small modification. The problem is this: Suppose we apply the 
principle strictly, adjusting away any differences beyond the amount we have 
specified in terms of the conditional stridard deviation. Then we could 
have a situation, in one of the groups, here two students could have the 
same unadjusted y score, but the student with the higher x score couJd 
receive a lower adjusted y score. To prevent this kind of unfairness we can 
Introduce an additional constraint: the size of the adjustment, that is, 
the nunber of points to be added to or subtracted from a student's y score, 
must be the same for all students answering the same question on the 
variable portion. The resulting adjustment would take the form of a 
constant for each group, to be added to (or subtracted from) the Y scores of 
all students in the group. 



In practice, the adjustment might be based on a much simpler model, 
treating the regression of Y on x In each group as linear and homoscedastlc. 
Table 3 shows the equations that describe this adjustment In terms of the 
observed x and y scores. We would regress Y on x In each group to get an 
equation for y-hat, the conditional mean, and an estimate of the residual 
standard deviation. We would then compute a pooled regression equation for 
y-hat, weighting each of the group expressions by the number of students In 
the group. Finally, we would compute, for each group, the difference 
between the group y-hat and the ponied y-hat, divided by the residual 
standard deviation for that group. If the absolute value of this number 
were smaller than the value we specified as the biggest difference we would 
believe, we would make no adjustment. If It were larger than the specified 
value, we would subtract off the specified value, and the remainder would be 
the size of the adjustment we would make to the score of each student In the 
group . 

What makes the problem of adjusting for different questions unlike the 
problem of adjusting for different readers? Certainly there are 
similarities. In both cases there has been a lot of effort to make adjust- 
ments unnecessary, and yet the data may suggest that there Is still room for 
Improvement. Just as different questions may measure different knowledge 
and skills, readers may differ In the types of knowledge they consider most 
Important. But there Is one Important difference between the two 
situations. Papers are assigned to readers by a process which can be 
assumed to be approximately random with respect to students* ability. 
Therefore, It Is perfectly reasonable to assume that the conditional 
distributions of essay scores— conditional on some other part of the 
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test — should not differ systematically from one reader to another, beyond 
what vc mlglit expect from sampling variation. Bur when students are allowed 
to choose their own questions to answer, there could very well be systematic 
differences In the ability measured by the variable portion, even when we 
condition on the common portion. The question Is how large an ability 
difference we are willing to believe Is genuine. Statistics cannot answer 
this question for us, but they can give us a way to express our answer and 
translate It Into a score adjustment that Is consistent with what we 
believe. 
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Table 1. Example of Data from Common and Variable Portions of an 

Examination: Deviation of Each Group Mean from Combined Mean, In 
Terms of Combined Standard Deviation. 



Group Selecting Variable Question 

2 3 4 5 6 

Common multiple-choice -0.12 +0.03 +0.04 +0.10 -0.35 
portion 

Common essay portion -.029 +0.05 +0.18 0.00 -0.?0 

Variable essay portion -0.12 +0.08 -0.03 -0.34 +0.14 

Number of students ' 3,411 38,445 1,390 10,382 5,180 
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Table 2. Two possible assumptions. 

Let ■ score on variable question 1 
■ score on variable question 2 
X ■ vector of responses on common portion 

- distribution In group taking variable question 1 
« distribution In group taking variable question 2 



Group 1 



Question I 



observed 



Question 2 

unobserved 
FjCYjl x) 



Group 2 



unobserved 



observed 

^2(^2 1 x) 



Option Ar Assume 



unobserved observed 
F^CY^I x) - FjCYjl x) 

unobserved observed 
F^CyJ x) - F^(Yj x) 



Implies that, conditional on x, groups 1 and 2 are equally able, 



unob s er ved obs erve d 



Option B: Assume 



Fl^Y^l X) 



F^(YJ x) 



unobserved observed 
FjCyJ x) - FjCYjl x) 



Implies that questions 1 and 2 are equally difficult. 
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Table 3. Proposed solution: 



!• In each group, regress on x to get | x - + x and an estimate 
of the residual standard deviation s(y^ • x) . 

2. Weighting each expression for by the number of students In the group, 
compute a pooled regression equation 

y 1 X « Z n,(a, + b. x)I/[2:n ] 

Spooled " 1 1 1 1 

pooled pooled 

3. For each group, compute a standardized difference index at the group 
mean x score: 



(y. I X.) - y J I X. 
^1 ' 1 "pooxed ' 1 
d - 

^ s(y^ • x) 

4, Let d* represent the maximum allowable standardized difference, 
If [d^l < d*, make no adjustment 
If d^ > d*, let the adjusted y^ be y^ - (d^-d*). 
If 'd^< - d*, let the adjusted y^ be y^ +(|dj-d*). 
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Pigurt t. Option A: CovarUnct •djustame. 
(RvpoChstical uui^la) 



Figurt 2. Option B: No adjuaCMnt. 

(Hypothetical ranple) 



VarUbls 

Portion 



Group 2, 
Question 1 
(ifl|iut«d) 



Group 1, 
Question I 
(observed) 




Group 2, 
QueeCion 2 
(obeerved) 



Group 1 , 
Question 2 
(imputed) 



Verisble 

Portion 



Group 2, (^jeetion 2 
(obeer\ed) 
« Group 2, Oueetion 1 
(imputed) 




Group it Oucstion t 

(obeerved) 
Croup I, 0u«9tlon 2 
(imputed) 



CoMpn Portion 



Common Portion 
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